2023-10-19
The idea of a p-value as one possible summary of evidence
morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
which morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
which morphed into a
which morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
rule for authors: reject the null hypothesis if p < .05, which morphed into a
rule for editors: reject the submitted article if p > .05, which morphed into a
rule for journals: reject all articles that report p-values.
Bottom line: Reject rules. Ideas matter.
2016
2019
… a world learning to venture beyond “p < 0.05”
This is a world where researchers are free to treat “p = 0.051” and “p = 0.049” as not being categorically different, where authors no longer find themselves constrained to selectively publish their results based on a single magic number.
In this world, where studies with “p < 0.05” and studies with “p > 0.05” are not automatically in conflict, researchers will see their results more easily replicated – and, even when not, they will better understand why.
The 2016 ASA Statement on P-Values and Statistical Significance started moving us toward this world. As of the date of publication of this special issue, the statement has been viewed over 294,000 times and cited over 1700 times-an average of about 11 citations per week since its release. Now we must go further.
The ASA Statement (2016) was mostly about what not to do.
The 2019 effort represents an attempt to explain what to do.
Some of you exploring this special issue of The American Statistician might be wondering if it’s a scolding from pedantic statisticians lecturing you about what not to dowith p-values, without offering any real ideas of what to do about the very hard problem of separating signal from noise in data and making decisions under uncertainty. Fear not. In this issue, thanks to 43 innovative and thought-provoking papers from forward-looking statisticians, help is on the way.
If you’re just arriving to the debate, here’s a sampling of what not to do.
http://jamanetwork.com/journals/jamaotolaryngology/fullarticle/2546529
A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.
The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time…. so it’s worth examining the prevalence of this error. Consider, for example, this division:
Now consider some typical p-values in these ranges: say, p = .005, p = .03, p = .08, and p = .2.
Translate these two-sided p-values back into z-scores…
| Description | really sig. | sig. | marginally sig. | not at all sig. |
|---|---|---|---|---|
| p value | 0.005 | 0.03 | 0.08 | 0.20 |
| Z score | 2.8 | 2.2 | 1.8 | 1.3 |
The seemingly yawning gap in p-values comparing the not at all significant p-value of .2 to the really significant p-value of .005, is only a z score of 1.5.
If you had two independent experiments with z-scores of 2.8 and 1.3 and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4, which is completely consistent with noise.
From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value can only be interpreted in the context of the null hypothesis of zero effect, and (much of the time), nobody’s interested in the null hypothesis.
Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.
From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).
The key point: The difference between statistically significant and NOT statistically significant is not, generally, statistically significant.
ASA Statement: “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”
“Not Even Scientists Can Easily Explain p Values” at fivethirtyeight.com
… Try to distill the p-value down to an intuitive concept and it loses all its nuances and complexity, said science journalist Regina Nuzzo, a statistics professor at Gallaudet University. “Then people get it wrong, and this is why statisticians are upset and scientists are confused.” You can get it right, or you can make it intuitive, but it’s all but impossible to do both.
“Statisticians found one thing they can agree on” at fivethirtyeight.com
A significant effect is not necessarily the same thing as an interesting effect. For example, results calculated from large samples are nearly always “significant” even when the effects are quite small in magnitude. Before doing a test, always ask if the effect is large enough to be of any practical interest. If not, why do the test?
A non-significant effect is not necessarily the same thing as no difference. A large effect of real practical interest may still produce a non-significant result simply because the sample is too small.
There are assumptions behind all statistical inferences. Checking assumptions is crucial to validating the inference made by any test or confidence interval.
“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”
ASA 2016 statement on p values
“For decades, the conventional p-value threshold has been 0.05,” says Dr. Paul Wakim, chief of the biostatistics and clinical epidemiology service at the National Institutes of Health Clinical Center, “but it is extremely important to understand that this 0.05, there’s nothing rigorous about it. It wasn’t derived from statisticians who got together, calculated the best threshold, and then found that it is 0.05. No, it’s Ronald Fisher, who basically said, ‘Let’s use 0.05,’ and he admitted that it was arbitrary.”
“People say, ‘Ugh, it’s above 0.05, I wasted my time.’ No, you didn’t waste your time.” says Dr. Wakim. “If the research question is important, the result is important. Whatever it is.”
The p-value is the most widely-known statistic. P-values are reported in a large majority of scientific publications that measure and report data. R.A. Fisher is widely credited with inventing the p-value. If he was cited every time a p-value was reported his paper would have, at the very least, 3 million citations - making it the most highly cited paper of all time.
What do you suppose the distribution of those p values is going to look like?
There are a lot of candidates for the most outrageous misuse of “statistical significance” out there.
In February 2014, George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an ASA discussion forum:
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
[I]t is unacceptably easy to publish statistically significant evidence consistent with any hypothesis.
The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?
… It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields statistical significance, and to then report only what worked. The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.
For more, see
The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or p-hacking and the research hypothesis was posited ahead of time
Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.
“In response to recommendations to redefine statistical significance to \(p \leq .005\), we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.” Visit link.
Gelman blog 2017-09-26 on “Abandon Statistical Significance”
“Measurement error and variation are concerns even if your estimate is more than 2 standard errors from zero. Indeed, if variation or measurement error are high, then you learn almost nothing from an estimate even if it happens to be ‘statistically significant.’”
Read the whole paper here
The American Statistician Volume 73, 2019, Supplement 1
Articles on:
We can make acceptance of uncertainty more natural to our thinking by accompanying every point estimate in our research with a measure of its uncertainty such as a standard error or interval estimate. Reporting and interpreting point and interval estimates should be routine.
How will accepting uncertainty change anything? To begin, it will prompt us to seek better measures, more sensitive designs, and larger samples, all of which increase the rigor of research.
It also helps us be modest … [and] leads us to be thoughtful.
The nexus of openness and modesty is to report everything while at the same time not concluding anything from a single study with unwarranted certainty. Because of the strong desire to inform and be informed, there is a relentless demand to state results with certainty. Again, accept uncertainty and embrace variation in associations and effects, because they are always there, like it or not. Understand that expressions of uncertainty are themselves uncertain. Accept that one study is rarely definitive, so encourage, sponsor, conduct, and publish replication studies.
Be modest by encouraging others to reproduce your work. Of course, for it to be reproduced readily, you will necessarily have been thoughtful in conducting the research and open in presenting it.
When stuck in a design, I think about how to get better data.
When stuck in an analysis, I try to turn a table into a graph.
431 Class 16 | 2023-10-19 | https://thomaselove.github.io/431-2023/